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I. BACKGROUND: THE UNITARY ORBIT OF A BIPARTITE QUANTUM STATE 

Let A and B be two quantum systems with states of A represented in m-dimensional Hilbert space Ha = C m and 
those of B in Hb = C". We assume that m, n > 2. Let p = pab be any state of the joint system Ha®Hb, with (real) 
eigenvalues Ai, A2, • • • , A m „ summing to 1. Quantum measurements performed on the local subsystems will in general 
reveal correlations between the two states pa = Tr# Pab and ps = Tr^ pab > some of which may be attributable 
to entanglement but others of which could be recreated classically in some sense by preparing the joint system in a 
probabilistic mixture of known product states - that is to say, in a separable state. Furthermore there is a very small 
discrete subset of these separable states known as the classical states: representable by diagonal matrices in the joint 
computational basis. The question of the extent to which any correlations are "genuinely quantum" is key to the 
resource-based theories of quantum information and quantum computation currently being developed. Indeed there 
are also many questions in thermodynamics (see for example [121. |13j and the references contained therein) which 
arise from viewing correlation as a resource in nanotechnological applications, where correlations between local states 
in quantum superpositions are demonstrably more powerful than those for classical states. 

There is a compelling question in the middle however - what about the correlations of separable states, which are 
generally non-classical but also not entangled? To make this distinction, we may speak of separable correlations giving 
the level of correlation inside joint states which are convex sums of product states (but which will not in general be 
classical), which usually display a higher degree of correlation than the purely classical states associated to the same 
spectrum. Similarly, classical correlations will refer to correlations within classical states. 

Given a fixed state p as above, the set of quantum states with identical spectrum are precisely those obtained from 
p via unitary transformations. Indeed, we may act upon our state p via transformations from the unitary group 
lA{mn) of degree run, generating the unitary orbit O p containing all quantum states which are reversibly obtainable 
from our starting state p. The key thing to note is that in traversing a generic unitary orbit we pass through points 
with only classical or separable correlations and then through a much larger set of inseparable quantum states whose 
entanglement is linked to the possibility of yet higher correlations. In other words, even though the spectrum remains 
constant, we nevertheless create and destroy correlations in the course of traversing the orbit O p : a statement which 
highlights the dependence of the notions we are discussing, upon the basis in which we have chosen to represent the 
states. 

So within O p we have a natural hierarchy of states with the potential for non-zero correlations, within which we 
would expect the classical states to be the "lowest" in some sense, and the pure entangled states to be the "highest" . 

This setup is depicted schematically in figure [TJ The large oval represents the convex set of all states; the inner 
circle at the bottom is the convex set of separable states, and the entangled states make up the remainder, as in a 
Venn diagram. Within this the orbit O p of a single state is depicted as the boundary of an oval - note that it is not 
in itself a convex set; however the convex hull of O p consists precisely of all of the unitary orbits of spectra which are 
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Maximally entangled states 
QMI = d log d 




( QMI = ) Pure product states 

(at the juncture between the separable 
states and the pure states) 

Unitary orbit O p of the state p within the space of all states. 
The mutual information is confined to a range of values over O p as shown. 



FIG. 1. 



majorised by the spectrum of p: in particular the maximally mixed state JrlcP (which is a unitary orbit consisting 
of just one point). Quantities which we shall refer to below are noted in the diagram: in particular we must point 
out that the mutual information (see below) scale on the right is very much a schematic one ... it must be read 
only in the context of a particular unitary orbit O p as shown - otherwise for example the maximally mixed state at 
the barycentre of the space would always lie "above" the minimally correlated state, which is clearly not the case in 
general. 

What is an appropriate measure for this scale? The standard both in quantum and classical information theory 
is mutual information, which loosely speaking measures the distance between a joint state and the product of its 
reduced subsystems. For any joint system as above we define the quantum mutual information (QMI) to be 

I(pab) = S(p A ) + S(p B ) - S{pab), 

where for any state a with spectrum = . . . , we denote by S(o~) — Tr (— crlogcx) its von Neumann entropy. 
Observe that for a classical state the matrix of a will be diagonal in the computational basis and so the definition of 
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von Neumann entropy reduces to the Shannon entropy 

N 

H({l 1 ,...,l N ))=J2-klogl i 

i=l 

of the probability vector (1%, h, ■ ■ ■ , In)- Indeed the definition of QMI then reduces to that of classical mutual infor- 
mation (CMI). However mutual information is not perfect for our purposes, because it does not in any way distinguish 
between classical and quantum correlations. Indeed it is quite common to have an entangled state with lower mutual 
information than a classical state: hence the current attempts to define measures which separate quantum correla- 
tions from classical ones, such as quantum "discord" and quantum "dissonance" . So to get a handle on what sorts of 
tradeoffs can occur between these quantum and classical correlations, a good starting point is to be able to delimit 
the maxima and minima of mutual information for each class (i.e. classical, separable or entangled) of states within a 
particular orbit. We should also highlight recent work by Partovi [3], in which he develops a neat, general framework 
capturing the notion of "disorder" in terms of majorisation theory - hence a stronger classification than is provided 
by entropic measures for example, yielding fewer relations. This approach is in a sense an orthogonal one to ours in 
that, instead of working with a fixed spectrum and moving over the unitary orbit as we have done below, the marginal 
spectra are fixed, with the total spectrum allowed to vary, revealing what constraints that places upon the possible 
minimally disordered states - be they classical, separable or entangled. 

It turns out |13) that the minimal mutual information within O p can always be realised on a classical state, hence a 
fortiori a separable state. Since the classical states form a discrete set it is in principle a straightforward problem to 
find the minimum (in general it will occur on one of a relatively small suite of permutations identifiable by a simple 
test - see |12j - but outside the case (m, n) — (2, 2) there is no particular configuration which will be the minimum in 
all cases). Another way to view this is to remark that being quite a coarse measure, the mutual information is unable 
to tell us anything about the nature of the state - i.e. whether it be classical, separable or entangled - for any joint 
states with sufficiently low correlations. 

Remark. It is important to note that the states we refer to as "unique" are only unique up to local unitary operations 
(and if m = n also the "transpose" operation of swapping the two systems A and B). Hence we shall tend to refer to 
unique classes of states rather than unique states. 

So we then ask about the maxima. When m = n the maximum overall mutual information occurs for a maximally 
entangled state |13] - indeed there will in general be an infinite number of such maximally entangled states yielding 
the maximal QMI. The cases where m ^ n are messier but the answer is similar. As with the minimum, a set of 
conditions can be laid down such that the maximum CMI must occur for a class of states which is one of a relatively 
small set of candidates; however outside the cases (m, n) — (2, 2) and (2, 3) this maximum configuration is non-unique. 
Indeed numerical studies indicate that for (2, 4) there are 2 classes which both occur as maxima for different spectra, 
for (2, 5) there are six and for (3, 3) there are 18. For comparison we mention that in the case of the minimal CMI, 
in (2, 2) the class is unique (as is the maximum), in (2, 3) there are exactly 5 possibilities, and then numerical results 
indicate that in (2,4) there are 14, in (2,5) there are 42 and in (3,3) there are 18. 

Little is known about the maximum separable state along a general unitary orbit. The only known way to access 
it is via convex optimisation using the Peres-Horodecki positive partial transpose criterion pQ, which outside of the 
cases (m, n) = (2, 2) and (2, 3) is only a necessary condition for separability and so does not really assure us of a 
result anyway. In section [IT] we first look at a special class of the (2,2) states where we can pin down the maximal 
separable state, and do some calculations to illustrate its behaviour as contrasted with the maximally and minimally 
correlated classical states on the same orbit. This is only achievable because of the neat framework laid out by 
R. and M. Horodecki [TT] for understanding the unitary orbit of a two-qubit state. In the (2, 3) case we do not have 
such a framework, and it is consequently much more difficult to understand the big picture. 

Section III contains the main result of this paper (theorem [l]) where we show that in the case where (m, n) = 
(2,3), the maximal CMI occurs always (uniquely, up to an action by 12 CMI-invariant transformations) at the state 
represented by a diagonal matrix containing a fixed ordering of the eigenvalues Ai, . . . , Ag. Curiously this fixed ordering 
is the same for every spectrum, irrespective of the relative sizes of the eigenvalues. As we mentioned above, whereas 
this is also true for (m, n) = (2, 2), it is not true for larger joint systems like (2, 4) or (3, 3). 
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II. SEPARABLE VERSUS CLASSICAL CORRELATIONS FOR THE TWO QUBIT CASE 

We restrict for a moment to the case (m, n) = (2, 2). As we shall see, non-trivial features arise even in the simplest 
possible setting. 

So we have a joint system Ha ® Hb of two qubits in a given state p — pab with spectrum {a, 6, c, d} satisfying 
a, b, c, d > and a + 6 + c + <i = 1. There is a representation [TT] of states of such a system in terms of the Pauli 
matrices, which gives two local reduced Bloch vectors ta, at j4 and B, together with a 3-by-3 real "correlation 
matrix" T = (Uj), giving a total of 3 + 3 + 9 = 15 real variables parametrising exactly the action of SU(4) on pab- 
Furthermore if we restrict to what they refer to in pj] as the T-states, namely those states with maximally mixed 
reductions (hence trivial Bloch vectors but maximal contributions each of log 2 to the mutual information) at A and 
B, then by local changes of basis we may arrange that T is in fact diagonal and so we are reduced to looking in these 
specific instances at just three real variables in, £22 and £33. Now a natural choice of spanning set for the T-states is 
the standard Bell basis 

l$+> = 71 (|00> + |n>) ' l<n = vV |00> ~ |n>) ' ] * +) = 7I (|01> + 110)) and |vn = 75 (|01> ~ |10>) ' 

Then from the constraints that Tr pab = 1 and that pab be a positive matrix we obtain a tetrahedron T of T-states 
with vertices |$+), |$~), |*+), |* _ ). 

These diagonal matrices may be represented by what they call a t-vector t = (in, £22, £33): the Bell basis elements 
correspond respectively to the t-vectors (1,-1,1), (—1,1,1), (1,1,-1) and ( — 1,-1,-1). In this framework there is 
a natural way to choose a maximal QMI state [13] for the given spectrum {a, b, c, d} of pab'- namely, it is the state 



whose t-vector is 



The QMI of this state is 



/W,qmi = + b\$-)($-\ + c|*+)<*+| + d|*-)<*-|, 

tmax.QMi = (a — 6 + c — d, — a + b + c — d, a + b — c - d) . 

Imax,om(a, b, c) = 2 log 2 - H((a,b, c, d)), (1) 



which is maximal over O p . Note also that since d = 1 — a — 6 — c we shall view all of these quantities as functions 
on R 3 rather than ]R 4 . The translation from the representation of the states with maximally mixed reductions in the 
T-state picture, back to the eigenvalue-picture is as follows: given a T-state vector t = (u, v, w) with zero local Bloch 
vectors the corresponding spectrum is 

1 + u — V + w 1 — U + V + w 1 + u + v — w. 

M,c) = ( s , - A , - A )• 

The reason that the T-state setup is so useful for our purposes is that the separable states with maximally mixed 
reduced states and diagonal T-matrix, turn out to be exactly those states whose eigenvalues are all less than or equal 
to g. These states trace out an octahedron O inside T which is given in the T-coordinate system by D = Tn — T. 
Its vertices are (±1, 0, 0), (0, ±1, 0), (0, 0, ±1). 

Since we are using the joint computational basis and writing things in terms of Pauli matrices, any classical state 
on the orbit O p may be written 

Pc lass = r( a )|00)(00| +r(b)|01)(01| +r(c)|10)(10| + r(d)|ll)(ll| 

for some r € S4, the symmetric group on four letters. (Here we have arbitrarily allocated the identity element of S4 
to the state where the eigenvalues a, b, c, d are arranged in alphabetical order down the diagonal which we denote by 
diag(a, 6, c, d)). The local Bloch vectors r^, i"s of p£ lass are no longer zero in general but rather 

a + b — c — rf x ^ . a — 6 + c — d. , „ , a — b—c + d, 
v\ = (0, 0, ), r T B = (0, 0, ), with t-vector t T = (0, 0, ). 

We assume from now on that a > b > c> d > 0. We know from [13] (or see appendix|B| that under these conditions 
the state 



p rain = diag(a, 6, c, d) 
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will give us the minimal QMI on O p : 

I min (a, b, c) = h(a + b) + h(a + c) - H({a, b, c, d)), (2) 

and that the state 

ftnax.class = di&g(a, d, C, b) 

corresponding to the permutation r = (2, 4) will give the maximal CMI on O p : 

4iax,ciass(a, b, c) = h[a + c) + h(b + c) - H((a, b, c, d)). (3) 

We have used the standard convention in ^ and ^ that h is the binary entropy function 

h(x) = — xlogx — (1 — x) log(l — x). 

Finally we define / ma x,scp to be the maximal QMI attainable on a separable state /O max ,sep m the orbit O p . Note 
that in general these maximal and minimal states will not be unique; whereas the value of the information can be 
abstractly uniquely defined. 

We wish to analyse the behaviour of the functions I m ax,QMi, / m ax,scp, Imin and / max , class as we roam over O p for 
some fixed spectrum {a,b, c, d}. Whereas p m ax,sep is difficult to find for a generic state, for illustrative purposes, we 
may restrict to the subset of spectra for which p maXj QMi lies inside the octahedron O of separable states, for then we 
are guaranteed that /5 m a X ,scp will coincide with /o max ,QMi, namely those with eigenvalues all less than or equal to i. 
Thus by construction, 

-^maXjQMI — -^max,sep (4) 

for all of the states we shall be considering in this section. Define "gap" functions 7 max and 7 m i n as the differences 
between the quantity in Q, and those in ([2| and ^ respectively: 

7max(a, b,c) = 2 log 2-h(a + c)- h(b + c) 

and 

7min(a, b,c) — 2 log 2 - h(a + b) - h(a + c). 

These represent the gaps in mutual information as we travel over different spectra, between the maximal QMI states 
and their maximal and minimal counterparts in the classical subset. Indeed 7 max is a signature function for the non- 
classicality of the state space: the states we are considering are not entangled; nevertheless they are able to manifest 
greater mutual information than would a purely classical state with the same spectrum. 

For the avoidance of confusion we should point out that by definition, 7 max < 7 m i n . 

The functions 7 max and 7mm are defined on the domain of spectra: 

V id = {(a, b, c) G M 3 : - > a > b > c > (1 - a - b - c) > 0}, 

which is a kind of pyramid with an irregular quadrilateral base. Its five vertices are at the points V% — (5, 5,0) 
(which is the apex of the pyramid), V 2 = (\, |, |), V 3 = (|, |, |), V A = (|, |, \) and V 5 = (|, 1, |). 

Note that for each rearrangement r G S4 of the positions of the eigenvalues we obtain another domain V T : in total 
these 24 domains glue together to form an octahedron which is a linear image of the regular octahedron D (see the 
diagrams below). As we cross from one fundamental domain into another the functions 7 max and 7 m in will need to 
be re-defined in order to take into account the new ordering of the eigenvalues. 

We restrict our attention therefore to the behaviour of 7 m a X , 7min on the convex region T> ld . Now — H is a 
convex function on its domain the unit interval [0, 1] and since the maps from M 3 to K given by (a, b, c) (a + b), 
(a, b, c) M- (a + c) and (a, b, c) 1 — >■ (& + c) are all linear it follows that 7 m a X and 7 m in are also convex on 2? ld . (Note that 
these "gap" functions will in fact be convex on the whole octahedral domain; however one needs always to rearrange 
the arguments in the definitions as remarked above). Hence 7 m a X and 7 m in will attain their maximal values on an 
extremal point of 2? ld , which means one or more of the vertices V\, V2, V3, V4, V5 above. By direct calculation we find 
that the maximum of j max is | log 3 — log 2 (around 0.1308 in the natural logarithm) and it occurs at the point V4. 
That is to say, this is the largest possible deviation of mutual information (from the classical values) once one is 
allowed the full scope of the quantum state space for these particular spectra. 
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For 7 m i n the maximum is log 2 and it occurs at the point V\ . (Hence upon acting by S4 we see that the maximal 
points for 7 m ; n are actually the six vertices of the octahedron). 

We can show directly from the definitions that the minimal values of these functions are always zero: 

7max(a, b, c) = if and only ii a = b = ^ - c = ^ - d; 

and 

7min( a i 6, c) = if and only i{a = b = c = d= — . 

Hence 7 max is zero on the line joining V\ and V2; while 7 m i n is zero only at the point V2 (which represents the 
maximally mixed state). 

On the next few pages we include some depictions of the behaviour of these two functions 7 max , 7min on the 
whole octahedron £), using the translation above from (a, 6, c)-space to the T-state space. The first picture shows 
the splitting of the octahedron into the (image of the) fundamental regions T> T . Notice that the diagrams are all in 
"t-vector space" and so care should be taken when thinking about probability distributions in terms of the spectra 
{o, 6, c, d} to use formulae like those given in the first part of this section in order to pass from the spectrum to the 
octahedron and vice-versa. 

It is worth making a few comments on these results. Firstly, the set of spectra, and their corresponding unitary 
orbits in state space, do not coincide with those unitary orbits lying entirely within the set of separable states. Such 
orbits correspond to the absolutely separable states [3] - namely those quantum states for which it is impossible to 
unitarily generate entanglement. Indeed, it has been shown E] that absolutely separable states have spectra that 
obey a < c + 2\/bd, and so are found to be a proper subset of the spectra that we consider. The implication of this 
is that orbits exist that contain entangled states, but attain their maximal mutual information on separable states. 
This can be seen more explicitly by considering the so-called maximally entangled mixed states (MEMS), being those 
states for which it is impossible to unitarily increase a given measure of entanglement . 

For the case of two qubits, the MEMS have been found to take the form 

Pmems =a\il>-)(il>-\ +6|00)(00| +c|^+)(^ + | (5) 

modulo local unitaries, and have maximal concurrence C MEMS = max(0, a — c—2y/bd). Indeed, for a fixed spectrum, the 
state Pmems not only maximizes concurrence, but also maximizes the negativity, the relative entropy of entanglement 
and the entanglement of formation [B]. It is immediately clear from (JsJ that while p MEMS might maximize entanglement, 
it generally does not have maximally mixed marginals, and cannot be a maximum of the QMI, which is somewhat 
surprising. Therefore, in the generic case of mixed quantum states there exist competing mechanisms between quantum 
and classical correlations over the orbit of the state, which stands in contrast with the case of pure quantum states, 
for which all correlation measures (quantum, classical and total) are simultaneously maximized on the maximally 
entangled states. 
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FIG. 3. A special case of the first diagram: one fundamental region V d embedded in the octahedron D, all embedded inside 
the green tetrahedron of T-states whose green dot vertices are the Bell states. The vertices of the octahedron are the classically 
correlated states like |(|00)(00| + |11)(11|) which all have CMI = log2. 
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FIG. 4. The octahedron D showing some sections of contours of the function 7 max together with the fundamental regions. 7 max 
measures the gap between maximal separable and maximal classical correlations for a fixed spectrum; each spectrum is here 
represented by a single point in O. 




'■Si 



FIG. 5. Alternative view of the function 7 max with more contours. The colour scheme in both these diagrams goes from blue 
(low) to red (high): the maximal points are the midpoints of the octahedron's 12 edges; whereas the centre of the cylindrical 
regions will be the blue lines shown in the fundamental regions above on which 7 max = 



10 




FIG. 6. The octahedron D showing the contours of the function 7 m in, which measures the full range of separable correlations 
attainable over a fixed spectrum. Broadly speaking the function increases with distance from the minimum value of zero at the 
barycentre of the octahedron: its extrema occur at the vertices of D. The intermediate contours are a kind of truncated cube 
or cuboctahedron 



III. AN ENTROPIC BINARY RELATION FOR 2x3 SYSTEMS AND THE UNIQUE MAXIMAL 

CLASSICAL STATE 

So we have seen that a 2 x 2 system admits a straightforward analysis of the "classical gap" lmax,class ~ ^min,class 
and that for many spectra (namely those where the eigenvalues are all < |), the "separable-classical gap" 7 max is also 
fairly easy to calculate. However already for the 2x3 case it becomes relatively non-trivial even to determine the 
classical gap. Indeed we make this the focus of this last section and discover a curious property of this 2x3 world: 
namely that I max . class is determined solely by an ordering of the eigenvalues and not by their relative sizes. It seems 
that there is just enough information to pin down the maximum (but again rather surprisingly not the minimum [12]); 
but increasing either dimension renders this impossible. 

We revert for a moment to the general setting of the introduction, in order to fix some ideas. So let A and B be two 
quantum systems with states of A represented in m-dimensional Hilbert space Ha — C m and those of B in Hb = C n . 
Let pab be any state of the joint system Ha <8> Hb, with (real) eigenvalues Ax, A2, . . . , A mn . We may consider the 
classical state lying in the unitary orbit of pab, which may be viewed purely as a diagonal matrix of probabilities 
summing to 1: 

diag(Ax,A 2 , ...,A mn ). (6) 

As we observed above, by implementing unitaries whose effect is simply to send classical states to classical states - 
that is, to permute the eigenvalues - we arrive at a series of different possibilities for the subsystems pa = Tr^ Pab 
and pb = Tr^ pab obtained by taking the respective partial traces of the joint system. In this setting QMI reduces 
to CMI: a function which is purely defined in terms of partial sums of the eigenvalues {Afc}. 

Now suppose that we are given an ordering on the set of eigenvalues Ai > A2 > . . . > A mn , but no further 
information on their relative sizes. We may then ask the question: is there a particular arrangement of these which 
will guarantee a priori to yield the minimal or maximal values of CMI associated with this entire class of diagonal 
matrices? 

As we mentioned in the first section, the answer in the simplest interesting case (m, n) = (2, 2) is that the maximum 
and the minimum may both be found by considerations of majorisation 12 , as there are only 3 distinct equivalence 
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classes of matrices under the CMI map. In the next simplest case (m, n) — (2, 3) the maximum is determined a priori 
and further there are 5 "minimal" matrices [13] , one of which will be the minimum in any given instance (and indeed 
all of which do occur in specific examples, meaning that a priori the set of minima cannot be whittled down any 
further without more stipulations on the relative sizes of eigenvalues). Not surprisingly, beyond these low-dimensional 
instances nothing terribly definitive can be said because the relative gaps between successive eigenvalues come to play 
too great a role. Indeed, the real surprise is that a definite maximum occurs in the 2 x 3-case. This means that the 
maximal CMI has an easy a priori determination in the one-qubit-by-one-qutrit context. The remainder of this paper 
is concerned with exploring the structure of this maximal CMI and establishing this unique maximum configuration. 

So let m = 2, n = 3. Throughout this paper when speaking about the 2 x 3-case we shall fix our set of six 
eigenvalues of the quantum state p — pab as {a, b, c, d, e, /} with a + b + c + d + e + f = 1 and assume that 
a>b>c>d>e>f>0 (we shall usually treat these as though they were strict inequalities in order to derive 
sharper statements but everything is valid if we allow > instead). The main result is as follows. 

Theorem 1. With notation as above, the permutation [ a, d, e, /, c,b ] giving rise to marginal probability vectors 
(a + /, c + d, 6 + e) and (a + d + e, 6 + c + /) has maximal CMI among all 720 possible permutations of [ a, b,c,d,e, f }. 
This is the case irrespective of the sizes of the gaps between a, b, c, d, e, f . 

In order to prove the theorem we need some preliminary ideas. 



A. Definitions: classical mutual information, majorisation and the entropic binary relation E> 



We set up the framework of the problem for general m, n. 



1. The classical mutual information (CMI) attached to an m x n probability matrix 

Suppose we are given a matrix pab in the form (|6| with a given splitting into a pair of subsystems A of dimension 
m and B of dimension n, so that we may arrange the eigenvalues in an m x n-matrix as follows: 



Cl 

Tl A„+i 



C2 

A 2 

A n +2 



Tin \A( m _!) n+1 A( TO _ 1 ) n+2 



A„ \ 

A2n 



A, 



P. 



(7) 



As shown we let the row sums be denoted by = Ylj=i \i-i)n+j for i = 1, . . . , m and similarly for the column sums: 
Cj = YhLi A( i _ 1 ) n+J for j = 1, ... , n. Then by the definition of the partial trace map (equivalently, the contraction of 
a tensor along a particular index) we see that the density matrices pa and ps referred to above are now the diagonal 
matrices diag(ri, . . . , r m ) and diag(ci, . . . , c„) respectively. 

So P has the form of a joint probability matrix where the marginal probabilities are given by the and the Cj . To 
define the classical mutual information (see [9], §2.3) we take the sum of the entropies of the ri and the Cj over all i, j 
and then subtract the sum of the individual entropies of the A^, for k = 1, . . . , mn. Formally: 

Definition 1. With notation as above, the classical mutual information I(P) of the matrix P is given by 



i(p) = -Ti log ~ c 3 lo § c 3 - ~ Afe log Xk 



(8) 



1=1 



fe=i 



We will often write H(x) = — xlogx for x £ [0, 1] and so we may rewrite as 



m n mn 

i(p) = J2 Hin) + ]T H{ Cj ) - ff (**)• 

i=l j=l k=l 
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2. Majorisation between two m x n probability matrices 



For definitions and basic results connected with majorisation, see [2] and [2]. We shall use the standard symbol 
>~ to denote majorisation. For any m x n-matrix M denote by r(M) £ R m the vector of marginal probabilities 
represented by the sums of the rows of M and similarly by c(M) £ M. n the vector of marginal probabilities created 
from the sums of the columns of M. 

Lemma 2. Let M\,M% be two probability matrices. Ifr{Mx) >~ r(M 2 ) and if c(M\) >~ c(M 2 ), then 

/(Mi) < J(M a ). 

Proof. See [12J : it follows from the fact that H is a Schur-concave function (see [2], §11.3). □ 

It should be pointed out that the converse is definitely NOT true: indeed it is this very failure which enables us to 
prove the main theorem of this paper. 

Definition 2. If the hypotheses of Lemma^hold then we write 

Mi >- M 2 

and we shall say that Mi majorises M 2 : but note that this matrix terminology is not standard. 

By symmetry the relation of majorisation between matrices is invariant under row swaps and/or column swaps. In 
addition if m = n then the majorisation relation is also invariant under transposition. 



3. An entropic binary relation E> among m x n probability matrices 

The entropic binary relation >, which we now define, is the key to proving theorem [T] If we consider the class 
of (mn)! matrices formed by permuting the entries in the matrix P in Q and look at the CMI of each of these, 
there is a rigid a priori partial order which arises between them [15] . Most of this can be explained by majorisation 
considerations; however in low dimensions there is a substantial set of relations which depends on a much finer graining 
than majorisation gives. This fine-graining is an entropic binary relation which is implied by the stronger relation of 
majorisation: see proposition [8] 

The general relation is defined as follows. Recall from above the definition of the classical mutual information I(P) 
of a probability matrix P. For any positive integer N we denote by Sjv the symmetric group on N letters. 

Definition 3. Let P = (pij) be any m x n probability matrix and let Q be an m x n matrix obtained by some 
permutation of the elements of P {that is, viewed as vectors: Q = P° for some a G S mn ). Suppose that a complete 
ordering is given of the p^. We say that P > Q if it can be shown a priori solely using this ordering of the entries, 
that I(Q) — I{P) is non-negative. 

In other words given any ordered probability vector (pij), if we arrange its elements into the orders displayed in P 
and Q then I(Q) ~ I(P) > 0. 

NB: In order to keep the terminology consistent with that of majorisation, we have adopted the convention that 
P ' > Q corresponds to I(P) < I{Q). 

That is to say, given an a priori ordering of the elements of the matrix, such a relation P t> Q holds irrespective 
of the relative sizes of these matrix entries. 

Remark. We mentioned above the connection with the symmetric group S mn . The partial order arising on the 
matrices gives a partial order on the space of cosets of S mn modulo a subgroup representing row and column swaps 



(see for example section IIIB 1), because the relations are guaranteed to hold for all m x n matrices depending as they 
do only upon the particular arrangement of the mn elements. This points to a deeper connection with combinatorial 
group theory which we explore in J7] 



In order to see what > means in the case which will most interest us - that of a simple transposition - we consider a 
general m x n probability matrix P = (jpij) with no assumed order among the entries p^. Let r be any transposition 
acting on P, interchanging two elements which we shall refer to as a and /3 (by a slight abuse of notation, since the 
positions and their values will be referred to by the same symbols). The following diagram illustrates this action of r 
on P: we write P T for the image of P under r. 
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/ Pu 

Pi\ 



Pvx 

P22 



C/3 



Pin \ 

Pin 



( Pn 

P21 



P12 
P22 



Pin \ 

P2n 



r 



P 



P 



P 



P\ (9) 



\p 



ml Pm2 



J 



\Pml Pm2 



Pmn J 



Without loss of generality we may stipulate that as matrix entries a > (3 (if they are equal there is nothing to be 
done). We wish to compare I(P) with I(P T ). Note firstly that by the definition of CMI, the difference I(P T ) — I(P) 
depends only on the rows and columns containing a, ft. All of the rest of the terms vanish as they are not affected by 
the action of r. We denote by r a (respectively rp) the sum of the entries in the row of P which contains a (respectively 
P), and by c a (respectively cp) the sum of the entries in the column of P which contains a (respectively /?). Similarly, 
we denote by r T a , rjj, c£, the image of these quantities under the action of r. See the diagram |9j) above. 

NB: r T a ,c T a (respectively, r^,Cg) no longer contain a (respectively /3), but rather /3 (respectively a). 

So the quantity we arc interested in becomes 



I{P T ) - I(P) = H{rl) - H(r a ) + ff(rj) - H{r, 3 ) + H(c T a ) - H(c a ) + H(c}) - H(c ), 



(10) 



with the proviso that if a and /3 happen to be in the same row (respectively column) then the r° (respectively, c°) 



terms vanish. The terms on the right hand side are grouped in pairs of the form ±(H(x + (a 
means we may write it in a more suggestive form: 



P)) - H(x)), which 



I{P T )-I{P) = {a-P) 



H(r a ) - H{rp 
a- (i 



ff (rg) - ff M 
a-P 



H{c a ) - H(cp 
a-p 



P J 



(11) 



In order to use calculus we need the machinery of Lagrangian means (see chapter VI §2.2 of [5] 



Definition 4. Let ip be a continuously differentiable and strictly convex or strictly concave function defined on a real 
interval I, with first derivative tp' . Define the Lagrangian mean fj,^ associated with <p to be: 



fi v {a,b) = 



' <p'~ x ( y( b )~y( a ) 



ifb^a 
if b = a 



(12) 



for any a,b € I, where tp' 
inverse of ip' . 



denotes the unique (on I, by virtue of strict convexity /concavity and differentiability) 



In other words, fj, v is the function which arises from the Lagrangian mean value theorem in the process of going 
from the points (a,tp(a)) and (b,ip(b)) subtending a secant on the curve of ip, to the unique (in this case) point 
H v {a, b) € [a, b] where the slope of the tangent to the curve ip is equal to that of the secant. See the diagram below. 

Each of the arguments for the function H in (111 lies in the interval / = [0, 1]. Since H is well-defined and indeed 



strictly concave and infinitely differentiable on / we may rewrite as: 

I(P T ) - I(P) = {a-p) {-H'^ H {rl,r a )) + H'^r^r})) - H> , (^ H (c r a ,c a )) + H'^cp, c}))) (13) 

ti H (rl,r a )fi H (c T a ,c a ) 



(a - (3) log 



^ H {rp.,r T fs )^ H {cp,c T l} ) ' 



(14) 



the second line following from the fact that in our context <p'(x) = H'(x) = —(1 + log(x)). Since (a — P) > by 
hypothesis, in order to determine which matrix gives higher CMI we only need consider the relative sizes of the 
numerator and denominator of the argument of the logarithm. So it is enough to study the quantity 



HH(rl,r a )n H (c T a ,c a ) - ^ H (rp,r T p)fi H {cp,c T p ). 



(15) 



We are now in a position to re-state what is meant by the entropic binary relation > for this special case of a 
transposition. 
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FIG. 7. Definition of u, 



Lemma 3. With notation as above, P t> P T if and only if it can be shown a priori that the quantity in (151 is 
non-negative. □ 

To study the function fin in more detail we shall need the following lemmas. 

Lemma 4. Let u < v < w < z be any four positive numbers satisfying v + w > u + z. Then vw > uz. 

Proof. Let £ = v — u > so that v = u + £ and z < w + £. Then vw = uw + £w > uw + £u > uz. □ 

Lemma 5. Let tjj be a concave monotonically increasing function of the non-negative real numbers taking positive 
values. Let p<q<r<sbe positive real numbers satisfying q + r > p + s. Then ip(q) + ijj(r) > ip{p) + ip{s) and 
consequently: 

1>(q) ■ V(r) > i/>(p) ■ (16) 



Remark. The condition q + r > p + s is sufficient to prove ( 16 1 but it is not necessary, as the example ip — identity 
and p = 0.1, q = 0.4, r — 0.4, s = 0.9 shows. 

Proof. We first show that 

^(g) + #0 >V(p) + </>(«)• 
Suppose to the contrary that V'(p) + VK S ) > V'(g) + V'( r ) : 011 rearranging we obtain 

ip(s) - ip(r) > ip(q) - tp(p). 

By a similar rearrangement of the hypothesis of the lemma we know that s — r < q — p and so combining these and 
using the fact that all terms are positive: 

■0(s) - ip(r) ip(q) - tpjp) 
s — r q — p 

By the mean value theorem there exist 7 € [r, s] and S € [p, q] such that tp'il) > "^'{S). Since tp is concave it follows 
that ip' is monotonically non-increasing, so this implies in turn that 7 < 5. But the intervals [p, g] and [r, s] are disjoint 
with the first entirely less than the second, hence 7 > 5, which is the desired contradiction. 

Now < ip(p) < ^P(q) < i J { r ) ^ "0( s ) by the hypothesis that ip is monotonically increasing, so inequality (16 1 follows 
from lemma|4]on setting u — ip(p), v — ip(q), w=tp(r), z=^(s). □ 



We now prove some facts about fiH which will give us an insight into the sign of the quantity in (15 1. 
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Lemma 6. Fix t e (0, 1). For x G (0, 1 - t): 

(i) IJ>h{x,x + 1) > and is strictly monotonically increasing in x; 

(ii) Hh(x, x + t) is strictly concave in x; 
(Hi) \ < \(p H {x,x + t) -x) < |. 

Note that (iii) says that the Lagrangian mean of x and y occurs between x + and x + Both extremes 

occur in the limit, so a priori we cannot narrow the range down further than this. 



Proof. Solving ( 12 ) explicitly for f = H we see that /j-h is in fact what is known as the identric mean of x and y: 

HH(x,y)=e ^— 

or if we set t = y — x: 

fj, H {x,x + t) = e 

\ x x 

= e- 1 (x + t)(l + -)^ . 

x 

From this, the fact that /j.h(x, x + 1) > for x, t > may be seen directly. Taking the first derivative with respect to 
x gives 

~ Oi H (x t x + t)) = e~\l + + *)? log(l + -) 

ox t X X 

which once again is positive for x, t > 0, proving that indeed fin{x, x + 1) is strictly monotonically increasing in x for 
fixed t. This proves (i). 

To prove (ii) we take the second derivative with respect to x (writing log 2 (X) for (log(X)) ): 



-1 t + X, 2 - * 



e log (1 + ^ 



We need to establish that this is always negative: this will be the case if and only if the right-hand term in the square 
brackets is negative. So we must show that for x, t > 0: 



1 t + x , 9 . t , 
- > -^log 2 (l + -), 

X t A X 



which for x > is the same as showing that 



Since t > is assumed fixed we may define a new variable £ = 1 + t/x and rewrite the left hand side as 



K-i) 

which is < 1 if and only if its (positive) square root is < 1, since £ takes values only between (1 + 1) and oo. So we 
need to show that for £ > 1, 

Vflogf x 
£-1 

or equivalently, 

e- v^io g £-i>o. 

The limit of the left hand side of this expression as £ tends to 1 from above is 0. So it is enough to show that for any 
£ > 1, the derivative: 

_ 1 + logy^ 

v 7 ? ' 
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of this left hand side is positive, which by a change of variable to z = \/f (which preserves our domain £ > 1) is 
simply the statement 

1 + log z < z, 

which is a standard fact about logarithms (see for example [TU] §5.2.4). 

We only require (i) and (ii) for the proof of theorem [Tj so since (iii) follows by similar techniques we omit the proof. 

□ 

We shall need the following sufficient condition for the entropic relation. Consider the four terms which constitute 



the first argument in each of the instances of the function /j,h in ( 15 ), namely 



r T a ,c T a ,rp,cp. (17) 

Observe that there are no a priori relationships between the sizes of these quantities. Let us consider the possible 
orderings of the four terms based upon what we know of the ordering of the matrix elements of P. In principle there 
are 24 such possibilities; however in certain instances of small dimension such as our 2x3 case, most of these may be 
eliminated and we are left with only a few orderings. 

In looking at |9]) for the special case where m = 2 (and n is any integer) one sees that the relations in the following 
diagram must always hold (ie irrespective of the values of the probabilities) , where a downward arrow between values 
X(f, and y^p indicates that x^> a priori. 



r ■ 



/\ /V 

V V 



FIG. 8. All fixed relations between the quantities r a , rp, c a , cp, r T a , rp, c T a , Cp in the 2xn case. 



Proposition 7. Suppose that the a priori minimum element in ^lty is either rp or cp. In addition suppose that we 
can verify a priori that r p + cp < r r a + c T a . Then P t> P T . 

Conversely, suppose that the a priori minimum element in J^lty is either r T a or c T a and in addition suppose that we 
can verify a priori that rp + cp > r r a + c T a . Then P < P T . 

Proof. We prove the first assertion; the second follows by symmetry. 

Without loss of generality (since we could at this stage equally consider the transposed matrices) we may assume 
that the minimum element in ( 17 1 is cp. Now if rp is not the maximum element in ( 17 ) then one of r r a , c r a is larger than 
rp, hence both rp and cp arc dominated by at least one or both of r T a , c T a , and so by the monotonicity of ^h{x, x + t) 
for fixed t (part (i) of lemma[6]) the expression in (15) must be non-negative, meaning P\>P T as required. So suppose 
to the contrary that rp is the maximum element in (17), meaning that the ordering of the elements is either 



cp < r T a < c T a < rp 



or 



cp<c T a <r T a < rp. 



By rewriting ( 15 1 in a more explicit form and writing t for a — (3, we need to show that 



Vn(r r a ,r r a + t)^ H (c T a ,c T a + t) - ^i H (rp,rp + t)^ H (cp,cp +t)>0. 



(18) 



But using parts (i) and (ii) of lemma [6] we see that viewed simply as a function of x for fixed values of t, iih{%, x + t) 
satisfies the hypotheses of lemma[5j So finally we let ^(x) = ^jlh{x, x + t) for fixed t = a — j3, let p = cp, s — rp and 
set {q, r } = \r r a , c T a } in the appropriate ordering. Then using the hypothesis of the proposition that rp + cp < r^+c^, 
we obtain the result (18) from lemma [5] □ 
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To tie E> back to majorisation we have the following result. 
Proposition 8. Let P, P T be as above. Then 

{P y P T ) (P> P T ) . (19) 
Furthermore if a and ft belong to the same row or column then the two notions of majorisation and t> are the same. 

Proof. In essence this is just lemma [2] and definition [2] however because the techniques are needed below, we prove it 
explicitly. 

By a standard result on majorisation (see Corollary II. 1.4 on p. 31 of [2]) since all terms arising from rows or columns 
not containing a or j3 are identical for both matrices, P y P T may be simplified to the statement that (as vectors): 

(r a ,rp) y {r T a ,r T ) and (c a ,c ) y (<£,c£). 

Now r a > r r a by definition, and since conversely rp < r^, it follows that the elements of each set may be ordered as 
follows: 

r a > max{r^, r^} > min{r^, rj 3 } > rp, and c a > max{c^, c T p} > min{c^, c§} > cp. (20) 



In the simple case where a and (3 belong to the same row, we may set the row terms in (151 to 1, and it becomes 
apparent that P\> P T is exactly the statement that (1h {c T a , c a ) > /inicp, c^), which by part (ijof lemma[6]is the same 
as saying that c T a > cp (since the quantity t = y — x = a — /3is the same for both sides). But c T a + = c a + cp and 
so we must have that c Q > c^. Since we already know (as a > /3) that c a > c T a it follows that 

(c a ,cp) y (c£,c£) 



that is to say P y P T . Conversely if P y P T then plugging c T a > cp into (15) - again ignoring the row terms - implies 
P t> P T . So in this case it is clear that P y P T is the same thing as P t> P T . An identical argument works for the 
other simple case where a and j3 belong to the same column. 

Finally let us consider the case where a and j3 are in different rows and columns. Suppose that P y P T ; we 



must show that P > P T . But a look at the relationships between row and column sums in ( 20 1 shows that this is a 



straightforward application of proposition [7J □ 

B. Proof of the main theorem 

So far we have constructed an abstract framework for the study of our entropic binary relation >; moreover we 
have shown that it is a necessary condition for majorisation. For the rest of the paper we specialise to the case of 
theorem [l] namely where m — 2 and n — 3 and as always a>b>c>d>e>f>0. In the terminology of 
definition[3] we need to view the permutation in the statement of theorem [l] as a matrix, which we shall call 

X=( a f tO- (2D 



/ c b 

1. The canonical matrix class representatives R.2x3 

Recall our original aim: given an ordering of these numbers {a, 6, c, d, e, /} we wished to establish whether there was 
an a priori permutation which would give us the minimal and/or the maximal possible mutual information. There 
are 6! = 720 possible permutations of these elements, giving a set of matrices which we shall refer to throughout 
as A^2x3- However since simple row and column swaps do not change the CMI, and since there are 12 = | S3 1 . | S2 1 
such swaps, we are reduced to only 60 = 720/12 different possible values for the CMI (provided that {a, 6, c, d, e, /} 
are all distinct: clearly repeated values within the elements will give rise to fewer possible CMI values). 

For convenience we shall standardize the form of a set of representatives of these 60 CMI-invariant classes of matrices. 
This set of chosen representatives will be referred to as P»,2x3- Since we may always make a the top left-hand entry of 

I a x 11 \ 

any of the matrices in AI2X.3 by row and/or column swaps, we set a basic form for our matrices asM=l^^^l, 
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where (as sets) {x,y,u, v,w} = {b,c,d,e, /}. This leaves us with only 5! = 120 possibilities which we further divide 
in half by requiring that x > y. So our final form for representative matrices will be: 

M = ( ^ X j , with x > y and a = max{a, x, y, u, v, w}. (22) 



for the 60 possible CMI values associated with 
ler subdivide R,2x3 as follows. Matrices whose 



This yields our promised 60 representatives R.2x3 in the form (22 
the fixed set of probabilities {a, b, c, d, e, /}. We now need to furt 
rows and columns are arranged in descending order will be said to be in standard form. It is straightforward to see 
that only five of the 60 matrices we have just constructed have this form, namely 

a b c \ ( a b d\ ( a b e\ ( a c d\ , ( a c e\ ,. oS 
d e f ) > ( c e f ) » ( a d / ) ' I b e f ) > *** ( b d f ) ' ^ 



Notice that all of these are in the form (22) with the additional condition that u > v > w. If we allow the bottom 
row of any of these to be permuted we obtain 5 = | S 3 1 — 1 new matrices which are not in standard form. In all this 



gives a total of 30 matrices split into five groups of 6, indexed by each matrix in (23). 

Now consider matrices in R,2x3 which cannot be in standard form by virtue of having top row entries which are 
"too small" but nevertheless which still have the rows in descending order, viz: 

Once again, by permuting the bottom row of each we obtain five new matrices: a second total of 30 matrices split 
into five groups of 6, indexed by each matrix in ( |24[ ). 

To visualize these subsets of matrices see the diagram on page [20| (together with the classification of R,2 X 3 se t out 
in appendix [A| which is the key to their enumeration). 

In order to facilitate the proof of theorem [lj here are a few results which help us to classify the relations between 

I a t) Q \ / a x y \ 

the R.2x3 classes. Call two matrices M = I ^ ^ j , N — ( I lexicographically ordered if the pair of row 

vectors {a,p, q, r, s, t) and (a, x, y, u, v, w) is so ordered (ie the word "apqrst" would precede the word "axyuvw" in an 
English dictionary). 

Lemma 9. We may order the matrices in R.2x3 lexicographically, and majorisation respects that ordering. 

That is to say, if M lies above N lexicographically then N cannot majorise M. Note that this is not the case for 
the relation t>. 

Proof. The existence of such an ordering is obvious; so let M y N where M,N £ R-2x3- We need to show that M 
precedes N in the lexicographical ordering. Looking first at the top row of each matrix: since they both contain a, 
and since a sum containing a can only be a priori majorised by another sum containing a, it follows that the top row 
sum of M must be > the top row sum of N a priori. But both top rows are ordered lexicographically. So the top 
row of M either precedes that of JV, in which case we are done; or else the top rows are in fact equal and so we must 



look at the columns. But this is just the argument of lemma 10 see below. □ 



/ a x v \ 

Lemma 10. Fix any matrix M = I w I ^ ^ 2x3 w ^ n ^ e additional requirement that u > v > w. Permuting the 

elements of the bottom row under the action of the symmetric group S 3 we have the following majorisation relations: 

a x y\ ( a x y 

a x y\ \uwvi y [vwul y fa x y , 

u v w J (ax y \ I a x y \ y-wvu 1 

v u w J \wuv 

There are in general no a priori majorisation relations within the two vertical pairs. 

Proof. Noting that only two of the column sums are changed at each step, apply definition [2] bearing in mind the 
assumptions that u > v > w and a > x > y. □ 
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Now the rightmost matrix in (25) corresponds to left multiplication by the permutation w — (m2i, W23) of the 

(CL X Zl \ f CL X 11 \ 

w v u ) = W \ u v ) ' ^ proposition 8 the fact that A majorises B implies that 

1(A) < 1(B), so the minimal value for the CMI among the representative matrices in R,2x3 must occur in a matrix of 
the form on the left-hand side of (25); conversely the maximum must occur in a matrix of the form on the right-hand 
side of (25}. 

Corollary 11. There is some M in p?3[ ) such that the minimal value for the CMI of any matrix from the set R,2x3 
is given by I(M). 

There is some A in {24) such that the maximal value for the CMI of any matrix from the set R2X3 is given 
byI{w(A)). 

Proof. Recall from lemma [2] the relationship between matrices, majorisation and the CMI. For the minima, one can 
show directly from the definitions that every matrix in (24) is majorised by some matrix in (23), and we then use 
lemma [lOl 

For the maxima we again use lemma [10] to reduce the problem to comparing the matrices obtained by applying w 
to ( 23 1 , with those obtained by applying w to ( |24[): t hen it can be shown once again from the definitions that every 
matrix in ( 23 ) majorises at least one matrix in ( |24[ ) . Indeed it may be shown directly that each matrix in ( 23 ) 
majorises the matrix X in (21 ). □ 



Aside: the basic majorisation structure in pictures 



Using the simple majorisation relations developed in the foregoing discussion we have established a kind of "honey- 
comb" which is the backbone of the partial order which is elaborated upon in [15] . Figure [9] shows the basic hexagonal 
frames corresponding to the majorisation orderings in (25 1 . The honeycomb consists of 10 hexagons each containing 6 
matrices (one row of 5 slightly below the other reflecting the "standard" classification) , with each matrix linked via 
a hexagonal pattern to the other matrices in its own group. Each hexagonal cell is in itself a diagram of the Bruhat 
order on S3. The 2 sets of 5 hexagons come from lemma 10 and the 12 lines of 5 matrices each (consisting of aligned 
vertices of the hexagons in their respective groupings) arise from variants of ( |23[) an d (24). The red numbers represent 
the "major" element in each hexagon and are in fact all of the matrices in (|23|) for the top row, and (24) for the 
bottom row. Note that we have placed the maximal CMI element 48 at the very bottom point, reflecting the fact 
that it is below every other matrix in the partial order induced by [>. The minimal CMI will occur for a matrix on 
the very top row (matrices 1, 7, 13, 25 or 31). 

The numbering is as per appendix [Aj ie the lexicographic ordering. We have stuck to this ordering as much as 
possible in the diagram itself, trying to increase numbers within the hexagons as we move down and from left to 
right; however in places we have changed it slightly so that the patterns are rendered more clearly. The black arrows 
represent the majorisation relations in lemma [10] which arise within each hexagon. 

The light blue double-headed arrows represent the action of the inner automorphism £ u arising from the unique 
element w = (1, 6)(2, 5)(3, 4) <E Sym 6 of maximal length (viewing Sym 6 as a Coxeter group - see [7] for definitions) 
which flips 22 pairs of matrix classes and fixes the remaining 16. Since this automorphism respects the relation \> on 
R-2x3 it follows that any entropic relations (including of course majorisation) involving the nodes which have a blue 
arrow pointing to them will occur in pairs, thus considerably simplifying the structure. We explore this in |15j . 



2. Completion of the proof of theorem^ 



We are reduced by corollary [IT] to showing that the CMI of the matrix X is greater than or equal to that of the 
other 4 matrices obtained by applying w to the remainder of (24). We state these matrices for convenience: 



Y\ = 



a b f 



Y 2 = 



a c f 
e d b 



Y* = 



a d f 
e c b 



Y A = 



a e f 
deb 



(26) 



We should remark first that there is no a priori majorisation relationship between any of these matrices X, Y\, Y 2l I3, 14. 
So we need a weaker (easier to satisfy) condition which distinguishes between them, namely the entropic relation [>. 
Since P \> Q implies that the CMI of P is lower than that of Q, it is a transitive relation on CMI of matrices and 
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1 




FIG. 9. Representation of the most basic horizontal-transposition-based majorisation relations on R.2x3 together with the 
action of the involution £ (the blue arrows) 

so it will suffice to show the following relations: 

(i) *i > Y 2 
(it) F 2 > X 
(m) F 4 t> Y 3 

(w) r 3 > 

For all of these we just apply proposition [7] as follows. 

(i) Let P — Yi,a — b, (5 — c and r swaps entries P12 and P23 (ie a and /?), yielding P T = Y^. Using the definitions 
we see that r T a = a + c + f, c T a = c + d, rp = c + d + e and cp — c + f. So indeed cp is always the minimal value of 
these four, and we see that in addition the hypothesis that rp + cp < r^ + c T a is satisfied. So P > P T by proposition [7] 

(iii) This time let P — Y4, a — d, (3 — e and r swaps entries P12 and P21, yielding P r = Y3. Again using the 
definitions: r T a — b + c + e, = a + e, r,g = a + e + f and = c + e. So once again cp is always the minimal value 
of these four, and the hypothesis that rp + cp < r T a + c T a is again satisfied. So P D> P T by proposition [7J 

(iv) Now P = Y3, a = e, = f and r swaps entries pis and P21, yielding P T = X. We have r T a = b + c + /, 
c^=a + f,rp = a + d + f and cp = b + f . Again cp is always the minimal value of these four, and the hypothesis 
that rp + cp < r r a + c T a is satisfied. So P t> P T by proposition [7j 

(ii) Finally the slightly trickier case of proving P = Y% > X, The reason this is different is that on the face of it, it 
does not consist of a single transposition but rather of a product of two disjoint transpositions (pi2,P22)(pi3,P2i) for 
which the intermediate matrices have no a priori relations with one another. However we may use the ^-relational 
framework above if we observe that in fact the single transposition r = (^11,^23) with a — a and /3 = b gives 

P T = ^ ^ ^ ^ J which is seen to have CMI equal to that of X. Since there is nothing in the definition of > which 

requires a matrix to be in R.2x3 (this is merely a convenient classification for keeping track of them), we may apply 
the same techniques as in (i), (iii) and (iv) to conclude that r T a = b + c + f , c T a = 6 + e, rp = b + d + e and cp = b + f: 
so cp is always the minimal value and since b+f + b + d + e < b + c + b + c + f we are again able to apply proposition [7] 
to conclude that the quantity I(X) — /(Y2) = I{P T ) — I[P) is positive, as required. 

This completes the proof of theorem [T] □ 

Remark. It is worth pointing out that one may arrive at the conclusion of theorem [7] by a process of heuristic 
reasoning, as follows. Recall from definition [7] that the CMI consists of three components, of which the last one 
is identical for all matrices which are permutations of one another. So in order to understand maxima/minima 



21 



we restrict our focus to the first two terms, namely the entropies of the marginal probability vectors. Now entropy 
is a measure of the "randomness" of the marginal probabilities: the more uniform they are the higher will be the 
contribution to the CMI from these row and column sum vectors. Beginning with the columns, if we look at the a 
priori ordering a>b>c>d>e>fitis evident that the most uniform way of selecting pairs in general so as to 
be as close as possible to one another would be to begin at the outside and work our way in: namely the column sum 
vector should read (a + /, b + e, c + d). Similarly for the row sums: we need to add small terms to a, but the position 
of f is already taken in the same column as a, so that just leaves d and e in the top row, and c and b fill up the bottom 
row in the order dictated by the column sums. We perform a similar analysis for the simpler case of 2x2 matrices in 
appendix ["B[ where in fact we can achieve a total ordering by the same method. 
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Appendix A: The matrix class representatives in R.2x3 



We list the matrix representatives in R.2x3 in lexicographic order together with the enumeration we have used 
throughout the paper when referring to them, alongside in each case the element a £ G = Sq in cycle notation 

which represents the appropriate permutation of the fiducial matrix f ^ ^ ^ J which we have chosen to represent 

the identity () € G. Note that each a is only chosen up to row- and column-swaps. Also, since we have chosen to 
represent the matrices with a in the top left-hand corner and with decreasing top row, the set of representative cycles 
displayed is effectively a copy of S5 modulo a subgroup of order 2. 
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Appendix B: The 2x2 case 



We set out here a detailed proof of the phenomenon of maximal and minimal CMI in the case of 2 x 2-matrices, 
which was first proven in |12j and [13] . We adopt a quite different approach, more direct in some sense than going 
via majorisation theory, because it gives some insight into what is really going on in the 2 x 3-case. 
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Let a>b>c>d>0 with a + b + c + d = 1 and let M = I a ^ ] be the corresponding 2x2 probability 

matrix. Define the (classical) mutual information I(M) as before (though in a slightly different but equivalent form 
to definition [T]) to be 

I(M) = h(a + b) + h(a + c) - V] -xlogx, 

x—a,b,c,d 

where h(x) is the standard binary entropy function 

h(x) = -xlogx- (1 - x)log(l -x) = H(x) + H(l - x), 

defined for x £ [0, 1]. Again, we wish to establish whether there is a permutation of the elements of M which gives us 
a priori the minimal or maximal possible mutual information. 

In analogy with the 2x3 situation above, we may consider the action of S 4 on the matrix M, denoting the places of 
(l 2 \ 

the matrix by I ^ J . Denote by A^2x2 the space of all 24 possible permutations of the matrix M (for a fixed choice 

of a,b,c,d). We may observe once again that the CMI of M is invariant under a large subgroup J of S4. Namely 
it is unchanged if we swap the rows, swap the columns or transpose the matrix. Given our choice of numbering the 
generators of our subgroup of S4 are canonically the elements (1,3)(2,4), (1,2)(3,4) and (2,3) respectively. The row 
and column swap operations commute with one another, but the transpose operation (2,3) causes our subgroup J to 
be a copy of the dihedral group D 8 . Explicitly: 

J = {(),(1,2,4,3),(1,4)(2,3),(1,3,4,2),(1,2)(3,4),(1,3)(2,4),(1,4),(2,3)}, 

and so we may write the right coset space as 

S 4 /J = {()J,(2,4)J,(3,4)J}, 

viewing the action of S4 as being via right translation, which in turn corresponds to a set of matrix representatives 
of each class respectively as 




M = [ a b ] , Af ( 2 < 4 > = [ a d 1 and = | a b 



In the 2x3 case we were able to find a unique maximum and 5 possible minima. In this much simpler case we can 
in fact order all three right coset classes a priori. 

Proposition 12. With notation as above, 

I(M) < I(M< M >) < I(M {2 ^). 

Remark. Recall the remark on page \2(\ after the proof of theorem [7J' it is possible to arrive at the conclusion of 
proposition by heuristic reasoning as follows. Following the method there we focus solely on the entropies of the 
marginal probability vectors (the row and column sum vectors in the text). Uniformity in these will yield higher 
entropies, hence for the maximum we should seek to have the row and column sum vectors each as near to (0.5,0.5) 
as possible. Clearly this will occur in general when we add a to d and b to c; however this cannot occur for both rows 

and sums so the next best thinq is to have a + c and b + d. Hence the maximal CMI will occur for the matrix ( a ^ ) . 

\cb) 

By similar reasoning the minimum must occur for the least uniform sums, namely a + b with c + d and then a + c 

and b + d leading to the minimal CMI occurring for f ° ^ ] . This leaves the middle value for the remaining matrix, 

\c dj 

which has of course the maximum- entropy set (a + d, b + c) together with the minimum- entropy set (a + b, c + d). 

Proof. From the discussion above it follows that the function I takes on at most three distinct values on the orbit 
of M under the action of S 4 . These three values are I(M), I{M {2 ^) and I(M^). So we have in the notation 
introduced above, 

/(Art 2 ' 4 )) = h(a + d) + h{a + c) - ^ —x log X, 

x=a,b,c,d 
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and 



7(M (3 ' 4) ) = h(a + b) + h(a + d) - ^ -xlogx. 

x—a,b,c,d 

Hence in order to prove the proposition we may simply consider the differences 

7(M (3 ' 4) ) - I(M) = h{a + d)- h(a + c) 

and 

I{M (2 ^) - I(M^) = h(a + c) - h(a + b). 

The claim of the proposition is that both of these quantities are non-negative. 

Here we need three basic properties of the function h(x) on its domain of definition (the unit interval), namely that 
it is continuous, symmetric about the line x — | and monotonic decreasing on either side of that line (moving always 
in the direction away from the central point of course). Note that this is weaker than needing concavity and maxima 
from calculus. 

Given these three conditions, the size of h(x) versus h(y) for x, y £ [0, 1] is measured precisely by how close each of 
x and y is to the point x = \. In other words, if \x— || < \y — h\ then h(x) > h(y). Hence we are reduced to showing 
that 



\a + d-^\ < \a + c-^\ < \a- 



1, 



(Bl) 



By the ordering a > b > c > d and the fact that a + b + c + d= 1, both 

, 1 1 

a + b > -, a + c > -. 

Notice that a + d may be either side of \ ; however 



|o + d-i| = i|2o + 2d-l| = i|o + d-6-c| < |^| 



,b-d, 



b-cl 



(B2) 



by the triangle inequality, noting for the last equality that a > c and b > d by assumption. By symmetry we may also 
write this as: 



1 1 a — b 
\a + d--\< — 



c-d 



(B3) 



Now using the facts about a, b, c, d once again we have: 



i| = a + c -~= l -(2a + 2c-l) 



1 



(a + c — b — d) 



a — b 
2 



c — d 
2 



(B4) 



and 



a + b - \\ = a + b - = \{2a + 2b -1) = \{a + b - c - d) = 



b-d 



(B5) 



2 1 and that \a - 



■d-\\ < |a- 



■b-h\ 



which combined with ( B2 ) and ( B3 ) prove directly that \a + d — ^ | < |a + c- 

Bu t it is clear moreover from the fact that a > b > c > d that (B4|<(B5|, which proves the rest of the inequality 
in plL □ 
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